NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Defining Replicability of Prediction Rules

https://doi.org/10.1214/23-STS891

Parmigiani, Giovanni (November 2023, Statistical Science)

Full Text Available
Prediction of hereditary cancers using neural networks

https://doi.org/10.1214/21-AOAS1510

Guan, Zoe; Parmigiani, Giovanni; Braun, Danielle; Trippa, Lorenzo (March 2022, The Annals of Applied Statistics)

Full Text Available
Cross-Cluster Weighted Forests

https://doi.org/arXiv:2105.07610

Ramchandran, Maya; Mukherjee, Rajarshi; Parmigiani, Giovanni (May 2021, ArXivorg)
null (Ed.)
Full Text Available
Optimal ensemble construction for multistudy prediction with applications to mortality estimation

https://doi.org/10.1002/sim.10006

Loewinger, Gabriel; Nunez, Rolando Acosta; Mazumder, Rahul; Parmigiani, Giovanni (February 2024, Statistics in Medicine)

It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets before model fitting can produce poor out‐of‐study prediction performance when datasets are heterogeneous. Theoretical and applied work has shownmultistudy ensemblingto be a viable alternative that leverages the variability across datasets in a manner that promotes model generalizability. Multistudy ensembling uses a two‐stagestackingstrategy which fits study‐specific models and estimates ensemble weights separately. This approach ignores, however, the ensemble properties at the model‐fitting stage, potentially resulting in performance losses. Motivated by challenges in the estimation of COVID‐attributable mortality, we proposeoptimal ensemble construction, an approach to multistudy stacking whereby we jointly estimate ensemble weights and parameters associated with study‐specific models. We prove that limiting cases of our approach yield existing methods such as multistudy stacking and pooling datasets before model fitting. We propose an efficient block coordinate descent algorithm to optimize the loss function. We use our method to perform multicountry COVID‐19 baseline mortality prediction. We show that when little data is available for a country before the onset of the pandemic, leveraging data from other countries can substantially improve prediction accuracy. We further compare and characterize the method's performance in data‐driven simulations and other numerical experiments. Our method remains competitive with or outperforms multistudy stacking and other earlier methods in the COVID‐19 data application and in a range of simulation settings.
more » « less
Receiver operating characteristic curves with an indeterminacy zone

https://doi.org/10.1016/j.patrec.2020.04.035

Parmigiani, Giovanni (August 2020, Pattern Recognition Letters)

Full Text Available
Bayesian Multi-study Factor Analysis for High-throughput Biological Data

https://doi.org/21-AOAS1456

De Vito, Roberta; Bellio, Ruggero; Trippa, Lorenzo; Parmigiani, Giovanni (January 2021, The annals of applied statistics)

This paper presents a new modeling strategy for joint unsupervised analysis of multiple high-throughput biological studies. As in Multi-study Factor Analysis, our goals are to identify both common factors shared across studies and study-specific factors. Our approach is motivated by the growing body of high-throughput studies in biomedical research, as exemplified by the comprehensive set of expression data on breast tumors considered in our case study. To handle high-dimensional studies, we extend Multi-study Factor Analysis using a Bayesian approach that imposes sparsity. Specifically, we generalize the sparse Bayesian infinite factor model to multiple studies. We also devise novel solutions for the identification of the loading matrices: we recover the loading matrices of interest ex-post, by adapting the orthogonal Procrustes approach. Computationally, we propose an efficient and fast Gibbs sampling approach. Through an extensive simulation analysis, we show that the proposed approach performs very well in a range of different scenarios, and outperforms standard Factor analysis in all the scenarios identifying replicable signal in unsupervised genomic applications. The results of our analysis of breast cancer gene expression across seven studies identified replicable gene patterns, clearly related to well-known breast cancer pathways. An R package is implemented and available on GitHub.
more » « less
Full Text Available
ComBat-seq: batch effect adjustment for RNA-seq count data

https://doi.org/10.1093/nargab/lqaa078

Zhang, Yuqing; Parmigiani, Giovanni; Johnson, W Evan (September 2020, Nucleic acids research)
null (Ed.)
The benefit of integrating batches of genomic data to increase statistical power is often hindered by batch effects, or unwanted variation in data caused by differences in technical factors across batches. It is therefore critical to effectively address batch effects in genomic data to overcome these challenges. Many existing methods for batch effects adjustment assume the data follow a continuous, bell-shaped Gaussian distribution. However in RNA-seq studies the data are typically skewed, over-dispersed counts, so this assumption is not appropriate and may lead to erroneous results. Negative binomial regression models have been used previously to better capture the properties of counts. We developed a batch correction method, ComBat-seq, using a negative binomial regression model that retains the integer nature of count data in RNA-seq studies, making the batch adjusted data compatible with common differential expression software packages that require integer counts. We show in realistic simulations that the ComBat-seq adjusted data results in better statistical power and control of false positives in differential expression compared to data adjusted by the other available methods. We further demonstrated in a real data example that ComBat-seq successfully removes batch effects and recovers the biological signal in the data.
more » « less
Full Text Available
Bayesian Combinatorial Multi-Study Factor Analysis

Grabski, Isabella N; De Vito, Roberta; Trippa, Lorenzo; Parmigiani, Giovanni (July 2020, ArXivorg)

Analyzing multiple studies allows leveraging data from a range of sources and populations, but until recently, there have been limited methodologies to approach the joint unsupervised analysis of multiple high-dimensional studies. A recent method, Bayesian Multi-Study Factor Analysis (BMSFA), identifies latent factors common to all studies, as well as latent factors specific to individual studies. However, BMSFA does not allow for partially shared factors, i.e. latent factors shared by more than one but less than all studies. We extend BMSFA by introducing a new method, Tetris, for Bayesian combinatorial multi-study factor analysis, which identifies latent factors that can be shared by any combination of studies. We model the subsets of studies that share latent factors with an Indian Buffet Process. We test our method with an extensive range of simulations, and showcase its utility not only in dimension reduction but also in covariance estimation. Finally, we apply Tetris to high-dimensional gene expression datasets to identify patterns in breast cancer gene expression, both within and across known classes defined by germline mutations.
more » « less
Full Text Available
Cross-study Learning for Generalist and Specialist Predictions

Ren, Boyu; Patil, Prasad; Dominici, Francesca; Parmigiani, Giovanni; Trippa, Lorenzo (July 2020, ArXivorg)

Jointly using data from multiple similar sources for the training of prediction models is increasingly becoming an important task in many fields of science. In this paper, we propose a framework for {\it generalist and specialist} predictions that leverages multiple datasets, with potential heterogenity in the relationships between predictors and outcomes. Our framework uses ensembling with stacking, and includes three major components: 1) training of the ensemble members using one or more datasets, 2) a no-data-reuse technique for stacking weights estimation and 3) task-specific utility functions. We prove that under certain regularity conditions, our framework produces a stacked prediction function with oracle property. We also provide analytically the conditions under which the proposed no-data-reuse technique will increase the prediction accuracy of the stacked prediction function compared to using the full data. We perform a simulation study to numerically verify and illustrate these results and apply our framework to predicting mortality based on a collection of variables including long-term exposure to common air pollutants.
more » « less
Full Text Available
Representation via Representations: Domain Generalization via Adversarially Learned Invariant Representations

Deng, Zhun; Ding, Frances; Dwork, Cynthia; Hong, Rachel; Parmigiani, Giovanni; Patil, Prasad; Sur, Pragya (June 2020, ArXivorg)

We investigate the power of censoring techniques, first developed for learning {\em fair representations}, to address domain generalization. We examine {\em adversarial} censoring techniques for learning invariant representations from multiple "studies" (or domains), where each study is drawn according to a distribution on domains. The mapping is used at test time to classify instances from a new domain. In many contexts, such as medical forecasting, domain generalization from studies in populous areas (where data are plentiful), to geographically remote populations (for which no training data exist) provides fairness of a different flavor, not anticipated in previous work on algorithmic fairness. We study an adversarial loss function for k domains and precisely characterize its limiting behavior as k grows, formalizing and proving the intuition, backed by experiments, that observing data from a larger number of domains helps. The limiting results are accompanied by non-asymptotic learning-theoretic bounds. Furthermore, we obtain sufficient conditions for good worst-case prediction performance of our algorithm on previously unseen domains. Finally, we decompose our mappings into two components and provide a complete characterization of invariance in terms of this decomposition. To our knowledge, our results provide the first formal guarantees of these kinds for adversarial invariant domain generalization.
more » « less
Full Text Available

« Prev Next »

Search for: All records